-
Notifications
You must be signed in to change notification settings - Fork 468
[NPU]: optimize GEGLU implementation with flatten 1D approach #1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Benchmark: |
| # TODO: we should find a better way to tune this. 1e4 is too large apparently | ||
| 1e-2 if device != "npu" else 1e4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what tensor couldn't pass with this tolerance? gradients or inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the question. I double-checked which tensors require the large tolerance.
On NPU with bfloat16:
- Forward outputs (y1 vs y2) differ at around O(1e2).
- Weight gradients (gate_proj / up_proj / down_proj) are also at O(1e2).
- The largest discrepancy is in the input gradients: x1.grad vs x2.grad can reach O(1e4).
So the forward and weight gradients are already numerically different at ~1e2, and the input gradients further amplify this difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
================================================================================
SUMMARY - Minimum atol needed for each tensor (rtol=1e-2):
================================================================================
output : min_atol=1e2 , max_abs_diff=2.048000e+03
gate_proj.weight.grad : min_atol=1e3 , max_abs_diff=2.048000e+03
up_proj.weight.grad : min_atol=1e2 , max_abs_diff=2.048000e+03
down_proj.weight.grad : min_atol=1e2 , max_abs_diff=2.048000e+03
input.grad : min_atol=1e4 , max_abs_diff=4.096000e+03
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting: the tolerances used here are consistent with the previous NPU GEGLU kernel implementation, so this change does not introduce new numerical error compared to the existing behavior on NPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem might be related to the implementation of npu's GEGLU. I guess it is performed in bf16 not fp32?
In liger the up projection is implicitly converted to fp32 before activation function gelu, and then converted back to bf16 before multiplying the gate projection.
Run:
import torch
a = torch.randn(1024, dtype=torch.bfloat16).cuda() # replace .cuda() with equivalent method on npu
gelu = torch.nn.GELU(approximate="tanh")
# pure bf16
b = gelu(a)
# upcast to fp32 and downcast back
c = gelu(a.float()).to(torch.bfloat16)
# Test identical result
torch.testing.assert_close(b, c, rtol=0, atol=0) # This is true on cudaIf the above script doesn't raise any error, showing gelu is also performed in fp32 on npu, then it's just numerical issues on gemm and errors are amplified due to large intermediate size. I don't mind keeping large atol in this case.
Otherwise, we probably want to modify gelu part to match torch's result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the standalone GELU behavior on NPU with bf16 input:
a = torch.randn(1024, dtype=torch.bfloat16).npu()
gelu = torch.nn.GELU(approximate="tanh")
b = gelu(a)
c = gelu(a.float()).to(torch.bfloat16)
torch.testing.assert_close(b, c, rtol=0, atol=0)
This test passes without error, which indicates that GELU on NPU is not computed in pure bf16, but matches the fp32 → bf16 behavior of PyTorch CUDA.
Therefore, the mismatch we observe in GEGLU is unlikely to be caused by the GELU implementation itself. It is more likely due to numerical differences in the bf16 GEMM, where errors can be amplified by large intermediate activations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
================================================================================
SUMMARY - Minimum atol needed for each tensor (rtol=1e-2):
================================================================================
output : min_atol=1e2 , max_abs_diff=2.048000e+03
gate_proj.weight.grad : min_atol=1e3 , max_abs_diff=2.048000e+03
up_proj.weight.grad : min_atol=1e2 , max_abs_diff=2.048000e+03
down_proj.weight.grad : min_atol=1e2 , max_abs_diff=2.048000e+03
input.grad : min_atol=1e4 , max_abs_diff=4.096000e+03
Try rtol=1e-1 and rerun this anaylsis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With pleasure. Below is a summary of the minimum atol required for each tensor when rtol=0.1:
================================================================================
SUMMARY — Minimum atol required per tensor (rtol=0.1)
================================================================================
output : min_atol=1e1 , max_abs_diff=2.048000e+03
gate_proj.weight.grad : min_atol=4e2 , max_abs_diff=2.048000e+03
up_proj.weight.grad : min_atol=8e0 , max_abs_diff=2.048000e+03
down_proj.weight.grad : min_atol=1e1 , max_abs_diff=2.048000e+03
input.grad : min_atol=2e3 , max_abs_diff=4.096000e+03
================================================================================
- Refactor Ascend GEGLU kernels to use flatten 1D grid-stride loop pattern instead of row-based tiling approach for better performance - Simplify block size calculation using compute_default_tiling_strategy - Align type conversion logic with GPU version for consistency - Update test tolerances for NPU bfloat16 (1e4) to handle precision differences
Hardware Type: Ascend 910B4
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence